Page 1 of 1

2015 Contribution to conference Open Access

Low-quality training data in information extraction
Marcheggiani D., Sebastiani F.
In the last five years there has been a flurry of work on information extraction, i.e., on algorithms capable of extracting, from informal and unstructured texts, mentions of concepts relevant to a given application. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, no work has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data often derives from the fact that the person who has annotated the data is different from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems as applied to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has also annotated the test data, and by whose judgment we must abide), with the accuracy deriving from training data annotated by a different coder.Source: Machine Learning and Data Analytics Symposium, Doha, Qatar, 8-9 March 2015

See at: ISTI Repository Open Access | CNR ExploRA

2015 Conference article Restricted

Word embeddings go to Italy: A comparison of models and training datasets
Berardi G., Esuli A., Marcheggiani D.
In this paper we present some preliminary results on the generation of word embeddings for the Italian language. We compare two popular word representation models, word2vec and GloVe, and train them on two datasets with different stylistic properties. We test the generated word embeddings on a word analogy test derived from the one originally proposed for word2vec, adapted to capture some of the linguistic aspects that are specific of Italian. Results show that the tested models are able to create syntactically and semantically meaningful word embeddings despite the higher morphological complexity of Italian with respect to English. Moreover, we have found that the stylistic properties of the training dataset plays a relevant role in the type of information captured by the produced vectors.Source: 6th Italian Information Retrieval Workshop, Cagliari, 25-26/05/2015

See at: ceur-ws.org Restricted | CNR ExploRA

2015 Report Open Access

ISTI young research award 2015
Bardi A., Candela L., Coro G., Dellepiane M., Esuli A., Gabrielli L., Gotta A., Lucchese C., Marcheggiani D., Nardini F. M., Palumbo F., Pietroni N., Rossetti G.
The ISTI Young Researcher Award is an award for young people of Institute of Information Science and Technologies (ISTI) with high scientific production. In particular, the award is granted to young staff members (less than 35 years old) by assessing the yearly scientific production of the year preceding the award. This report documents procedure and results of the 2015 edition of the award.Source: ISTI Technical reports, 2015

See at: ISTI Repository Open Access | CNR ExploRA

2015 Conference article Open Access

A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining
Jimenez Zafra S., Berardi G., Esuli A., Marcheggiani D., Martin-Valdivia M. T., Moreo Fernández A.
We present the Trip-MAML dataset, a Multi-Lingual dataset of hotel reviews that have been manually annotated at the sentence-level with Multi-Aspect sentiment labels. This dataset has been built as an extension of an existent English-only dataset, adding documents written in Italian and Spanish. We detail the dataset construction process, covering the data gathering, selection, and annotation. We present inter-annotator agreement figures and baseline experimental results, comparing the three languages. Trip-MAML is a multi-lingual dataset for aspect-oriented opinion mining that enables researchers (i) to face the problem on languages other than English and (ii) to the experiment the application of cross-lingual learning methods to the taskSource: Conference on Empirical Methods in Natural Language Processing, pp. 2533–2538, Lisbon, 17-21/0972015
DOI: 10.18653/v1/d15-1302
Metrics:

2015 Conference article Open Access

On the impact of Entity Linking in microblog real-time filtering
Berardi G., Ceccarelli D., Esuli A., Marcheggiani D.
Microblogging is a model of content sharing in which the temporal locality of posts with respect to important events, either of foreseeable or unforeseeable nature, makes applications of real-time filtering of great practical interest. We propose the use of Entity Linking (EL) in order to improve the retrieval effectiveness, by enriching the representation of microblog posts and filtering queries. EL is the process of recognizing in an unstructured text the mention of relevant entities described in a knowledge base. EL of short pieces of text is a difficult task, but it is also a scenario in which the information EL adds to the text can have a substantial impact on the retrieval process. We implement a start-of-the-art filtering method, based on the best systems from the TREC Microblog track real-time adhoc retrieval and filtering tasks , and extend it with a Wikipedia-based EL method. Results show that the use of EL significantly improves over non-EL based versions of the filtering methods. Copyright is held by the owner/author(s).Source: SAC'15 - 30th Annual ACM Symposium on Applied Computing, pp. 1066–1071, Salamanca, Spain, 13-17 April 2015
DOI: 10.1145/2695664.2695761
DOI: 10.48550/arxiv.1611.03350
Metrics:

2014 Contribution to conference Open Access

On the effects of low-quality training data on information extraction from clinical reports
Marcheggiani D., Sebastiani F.
In the last five years there has been a flurry of work on information extraction from clinical documents, i.e., on algorithms capable of extracting, from the informal and unstructured texts that are generated during everyday clinical practice, mentions of concepts relevant to such practice. Most of this literature is about methods based on supervised learning, i.e., methods for training an information extraction system from manually annotated examples. While a lot of work has been devoted to devising learning methods that generate more and more accurate information extractors, little work (if any) has been devoted to investigating the effect of the quality of training data on the learning process. Low quality in training data sometimes derives from the fact that the person who has annotated the data is different (e.g., more junior) from the one against whose judgment the automatically annotated data must be evaluated. In this paper we test the impact of such data quality issues on the accuracy of information extraction systems oriented to the clinical domain. We do this by comparing the accuracy deriving from training data annotated by the authoritative coder (i.e., the one who has annotated the test data), with the accuracy deriving from training data annotated by a different coder. The results indicate that, although the disagreement between the two coders (as measured on the training set) is substantial, the difference in accuracy is not so. This hints at the fact that current learning technology is robust to the use of training data of suboptimal quality.Source: IIR 2014 - 5th Italian Information Retrieval Workshop, Roma, Italy, 20-21 January 2014

See at: ISTI Repository Open Access | CNR ExploRA

2014 Conference article Open Access

Hierarchical multi-label conditional random fields for aspect-oriented opinion mining
Marcheggiani D., Tackstrom O., Esuli A., Sebastiani F.
A common feature of many online review sites is the use of an overall rating that summarizes the opinions expressed in a review. Unfortunately, these document-level ratings do not provide any information about the opinions contained in the review that concern a specific aspect (e.g., cleanliness) of the product being reviewed (e.g., a hotel). In this paper we study the finer-grained problem of aspect-oriented opinion mining at the sentence level, which consists of predicting, for all sentences in the review, whether the sentence expresses a positive, neutral, or negative opinion (or no opinion at all) about a specific aspect of the product. For this task we propose a set of increasingly powerful models based on conditional random fields (CRFs), including a hierarchical multi-label CRFs scheme that jointly models the overall opinion expressed in the review and the set of aspect-specific opinions expressed in each of its sentences. We evaluate the proposed models against a dataset of hotel reviews (which we here make publicly available) in which the set of aspects and the opinions expressed concerning them are manually annotated at the sentence level. We find that both hierarchical and multi-label factors lead to improved predictions of aspect-oriented opinions.Source: ECIR 2014 - Advances in Information Retrieval. 36th European Conference on Information Retrieval, pp. 273–285, Amsterdam, The Netherlands, 13-16 April 2014
DOI: 10.1007/978-3-319-06028-6_23
Metrics:

2014 Conference article Open Access

An Experimental Comparison of Active Learning Strategies for Partially Labeled Sequences
Marcheggiani D., Thierry A.
Active learning (AL) consists of asking human annotators to annotate automatically selected data that are assumed to bring the most benefit in the creation of a classifier. AL allows to learn accurate systems with much less annotated data than what is required by pure supervised learning algorithms, hence limiting the tedious effort of annotating a large collection of data. We experimentally investigate the behavior of several AL strategies for sequence labeling tasks (in a partially-labeled scenario) tailored on Partially-Labeled Conditional Random Fields, on four sequence labeling tasks: phrase chunking, part-of-speech tagging, named-entity recognition, and bio-entity recognition.Source: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 898–906, Doha, Qatar, 25-29 /10 2014

See at: aclweb.org Open Access | CNR ExploRA

2014 Doctoral thesis Unknown

Beyond linear chain: a journey through conditional random fields for information extraction from text
Marcheggiani D.
Natural language, spoken and written, is the most important way for humans to communicate information to each other. In the last decades emph{natural language processing} (NLP) researchers have studied methods aimed at making computers "understand" the information enclosed in human language. emph{Information Extraction} (IE) is a field of NLP that studies methods aimed at extracting information from text so that it can be used to populate a structured information repository, such as a relational database. IE is divided into several subtasks, each of which aims to extract different structures from text, such as entities, relations, or more complex structures such as ontologies. In this thesis the term ``information extraction'' is (somehow arbitrarily) used to identify only the subtasks that are formulated as emph{sequence labeling} tasks. Recently, the main approaches by means of which IE has been tackled rely on supervised machine learning, which needs human-labeled data examples in order to train the systems that extract information from yet unseen data. When IE is tackled as a sequence labeling task (as in e.g., emph{named-entity recognition}, emph{concept extraction}, and in some cases emph{opinion mining}), among the best-performing supervised machine learning methods are certainly emph{probabilistic graphical models}, and, specifically, emph{Conditional Random Fields} (CRFs). In this thesis we investigate two major aspects related to information extraction from text via CRFs: the creation of CRFs models that outperform the commonly adopted, state-of-the-art, ``linear-chain'' CRFs, and the impact of the quality of training data on the accuracy of CRFs system for IE. In the first part of the thesis we use the capabilities of the CRFs framework to create new kinds of CRFs (i.e., two-stage, ensemble, multi-label, hierarchical), that unlike the commonly adopted linear-chain CRFs have a customized structure that fits the task taken into consideration. We exemplify this approach on two different tasks, i.e., IE from medical documents and opinion mining from product reviews. CRFs, like any machine learning-based approach, may suffer if the quality of the training data is low. Therefore, the second part of the thesis is devoted to (1) the study of how the quality of the training data affects the accuracy of a CRFs system for IE; and (2) the production of human-annotated training data via semi-supervised emph{active learning} (AL).

See at: CNR ExploRA

2013 Journal article Open Access

An enhanced CRFs-based system for information extraction from radiology reports
Esuli A., Marcheggiani D., Sebastiani F.
We discuss the problem of performing information extraction from free-text radiology reports via supervised learning. In this task, segments of text (not necessarily coinciding with entire sentences, and possibly crossing sentence boundaries) need to be annotated with tags representing concepts of interest in the radiological domain. In this paper we present two novel approaches to IE for radiology reports: (i) a cascaded, two-stage method based on pipelining two taggers generated via the well known linear-chain conditional random fields (LC-CRFs) learner and (ii) a confidence-weighted ensemble method that combines standard LC-CRFs and the proposed two-stage method. We also report on the use of "positional features", a novel type of feature intended to aid in the automatic annotation of texts in which the instances of a given concept may be hypothesized to systematically occur in specific areas of the text. We present experiments on a dataset of mammography reports in which the proposed ensemble is shown to outperform a traditional, single-stage CRFs system in two different, applicatively interesting scenarios.Source: Journal of biomedical informatics 46 (2013): 425–435. doi:10.1016/j.jbi.2013.01.006
DOI: 10.1016/j.jbi.2013.01.006
Metrics:

See at: Journal of Biomedical Informatics Open Access | www.sciencedirect.com Restricted | CNR ExploRA

2012 Conference article Open Access

Metadata enrichment services for the Europeana digital library.
Berardi G., Esuli A., Gordea S., Marcheggiani D., Sebastiani F.
We demonstrate a metadata enrichment system for the Europeana digital library. The system allows different institutions which provide to Europeana pointers (in the form of metadata records - MRs) to their content to enrich their MRs by classifying them under a classification scheme of their choice, and to extract/highlight entities of significant interest within the MRs themselves. The use of a supervised learning metaphor allows each content provider (CP) to generate classifiers and extractors tailored to the CP's specific needs, thus allowing the tool to be effectively available to the multitude (2000+) of Europeana CPs.Source: Theory and Practice of Digital Libraries. Second International Conference, pp. 508–511, Paphos, Cyprus, 23-27 September 2012
DOI: 10.1007/978-3-642-33290-6_61
Metrics:

See at: nmis.isti.cnr.it Open Access | doi.org Restricted | link.springer.com | CNR ExploRA

2012 Conference article Open Access

ISTI@ TREC Microblog track 2012: real-time filtering through supervised learning
Berardi G., Esuli A., Marcheggiani D.
Our approach to the microblog filtering task is based on learning a relevance classifier from an initial training set of relevant and non relevant tweets, generated by using a simple retrieval method. The classifier is then retrained using the (simulated) user feedback collected during the training process, in order to improve its accuracy as the filtering process goes on. In the official runs the system scored low effectiveness values, suffering a strong imbalance toward recall.Source: TRC 2012 - 21th Text Retrieval Conference, Gaithersburg, US, 6-9 November 2012

See at: trec.nist.gov Open Access | CNR ExploRA

2011 Conference article Restricted

ISTI @ TREC Microblog Track 2011: Exploring the Use of Hashtag Segmentation and Text Quality Ranking
Berardi Giacomo, Esuli Andrea, Marcheggiani Diego, Sebastiani Fabrizio
In the first year of the TREC Micro Blog track, our participation has focused on building from scratch an IR system based on the Whoosh IR library. Though the design of our system (CipCipPy) is pretty standard it includes three ad-hoc solutions for the track: (i) a dedicated indexing function for hashtags that automatically recognizes the distinct words composing an hashtag, (ii) expansion of tweets based on the title of any referred Web page, and (iii) a tweet ranking function that ranks tweets in results by their content quality, which is compared against a reference corpus of Reuters news. In this preliminary paper we describe all the components of our system, and the efficacy scored by our runs. The CipCipPy system is available under a GPL license.Source: 20th Text Retrieval Conference, TREC 2011, Gaithersburg, US, November 15-18 2011

See at: trec.nist.gov Restricted | CNR ExploRA

2011 Report Unknown

ASSETS - Specification of ingestion services
Baccianella S., Esuli An., Marcheggiani D., Sebastiani Fa., Gordea S.
This document contains a specification of the services to be developed within tasks "T2.1.1 Metadata cleaning", "T2.1.2 Knowledge extraction", and "T2.1.3 Metadata classification", all of them under the responsibility of CNR. For each activity, a scientific analysis and a detailed specification of the API level is provided.Source: Project report, ASSETS, Deliverable D2.1.1, 2011
Project(s): ASSETS

See at: CNR ExploRA

2010 Journal article Unknown

Extracting information from free-text mammography reports
Esuli A., Marcheggiani D., Sebastiani F.
Researchers from ISTI-CNR, Pisa, aim at effectively and efficiently extracting information from free-text mammography reports, as a step towards the automatic transformation of unstructured medical documentation into structured data.Source: ERCIM news 82 (2010): 60–61.

See at: CNR ExploRA

2010 Conference article Unknown

Sentence-based active learning strategies for information extraction
Esuli A., Marcheggiani D., Sebastiani F.
Given a classifier trained on relatively few training examples, active learning (AL) consists in ranking a set of unlabeled examples in terms of how informative they would be, if manually labeled, for retraining a (hopefully) better classifier. An important text learning task in which AL is potentially useful is information extraction (IE), namely, the task of identifying within a text the expressions that instantiate a given concept. We contend that, unlike in other text learning tasks, IE is unique in that it does not make sense to rank individual items (i.e., word occurrences) for annotation, and that the minimal unit of text that is presented to the annotator should be an entire sentence. In this paper we propose a range of active learning strategies for IE that are based on ranking individual sentences, and experimentally compare them on a standard dataset for named entity extraction.Source: 1st Italian Information Retrieval Workshop, pp. 41–45, Padova, IT, 27-28 January 2010

See at: CNR ExploRA

2010 Conference article Unknown

ISTI@SemEval-2 Task 8: boosting-based multiway relation classification
Esuli A., Marcheggiani D., Sebastiani F.
We describe a boosting-based supervised learning approach to the "Multi-Way Classification of Semantic Relations between Pairs of Nominals" task #8 of SemEval-2. Participants were asked to determine which relation, from a set of nine relations plus "Other", exists between two nominals, and also to determine the roles of the two nominals in the relation. Our participation has focused, rather than on the choice of a rich set of features, on the classification model adopted to de- termine the correct assignment of relation and roles.Source: 5th International Workshop on Semantic Evaluation, pp. 218–221, Uppsala, Sweden, 15-16 July 2010

See at: CNR ExploRA